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Abstract: The authors of the study “The Long-Term Impact of Teachers” claim that their study 
shows that increases in teacher value-added lead to significant and lasting increases in test scores and 
significant increases in income that will last throughout adulthood. Instead, I show that these claims 
are false because they are contradicted by the findings of the study itself. In fact, the results of the 
Chetty et al. study raise serious questions about the benefits of using the value-added method for 
evaluating teachers. 
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Resultados vs interpretacion en "Los efectos a largo plazo de los docentes" 

Resumen: Los autores del estudio "El impacto a largo plazo de maestros" (original en ingles “The 
Long-Term Impact of Teachers”) afirman que su estudio demuestra que aumentos en el valor 
agregado de los docentes conducen a aumentos significativos y duraderos en pmebas de resultados 
academicos y aumentos significativos en los ingresos economicos a lo largo de la edad adulta. Por el 
contrario, mostramos que estas afirmaciones son falsas, ya que se contradicen con las conclusiones 
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del propio estudio. Asi, los resultados del estudio de Chetty et al. plantean serias dudas sobre los 
benefkios de usar el metodo de valor agregado para la evaluation de los docentes. 

Palabras clave: modelos de valor anadido (MVA); evaluation docente 

Resultados versus desempenho “impacto de longo prazo de professores” 

Resumo: Os autores do estudo "O impacto de longo prazo de professores" dizem que seu estudo 
mostra que os aumentos no valor agregado dos professores leva a aumentos significativos e 
duradouros em provas de desempenho academico e aumentos significativos de renda durante a vida 
adulta. Contrariamente nos mostramos que essas alegagSes sao falsas, e que contradizem as 
conclusoes do proprio estudo. Assim, os resultados do estudo Chetty et al. levantam serias duvidas 
sobre os beneficios de usar modelos de valor agregado para avaliar os professores. 

Palavras-chave: modelos de valor agregado (MVA); avaliacao dos professores. 


Introduction 

The study, “The Long-Term Impacts of Teachers: Teacher Value-Added and Student 
Outcomes in Adulthood,” by Raj Chetty, John N. Friedman and Jonah E. Rockoff (Chetty et al., 

2011) has not yet been published in an academic journal, but it has nevertheless received wide 
attention. One of the authors, Raj Chetty, an economist at Harvard, received a 2012 MacArthur 
Foundation award for his “rigorous theoretical and empirical studies [that] are informing the design 
of effective government policy” and the foundation singled out “The Long-Term Impact of 
Teachers.” The foundation’s statement read: “Chetty and colleagues found that, adjusting for other 
factors, students who by chance were assigned to talented teachers in elementary school had 
significantly higher incomes as adults and better future life outcomes more generally. By asking 
simple, penetrating questions and developing rigorous theoretical and empirical tests, Chetty’s 
timely, often surprising, findings in applied economics are illuminating key policy issues of our time" 
(The MacArthur Foundation, 2012). 

The New York Times covered the study on its front page (Lowery, 2012) and President Obama 
relied on it in his 2012 State of the Union Address when he asserted “a great teacher can offer an 
escape from poverty to the child who dreams beyond his circumstance” (Massari, 2012). The value- 
added idea became a major bone of contention during the 2012 Chicago teachers' strike, when many 
opinion pieces about the strike made reference to the Chetty et al. study (Strauss, 2012). In New 
Zealand the Treasury Department posted the article on its website. 

There is just one problem: as we explain below, the study does not show what the authors 
claim it shows. It does not show a long-term impact on earnings by high "value-added" teachers, nor 
does it show a lasting impact on test scores. But there have also been other concerns raised about 
the study; these will be reviewed first. 

Michael Winerip, a New York Times reporter, highlighted an issue with the study that the 
authors themselves recognized: The adults in the study were children in the 1990s, a time when the 
students' test results had no effect on the pay of their teachers (Winerip, 2012). If teachers are 
judged by their students' test results, they may change the way they teach and resort to “teaching to 
the test.” As Winerip points out, the problem is that it is possible that the teachers who would 


The author wishes to thank Tom Russell, Ellen Adler, Alec Meiklejohn, and Henry Braun for helpful 
suggestions. Special thanks to Sarah Polasky and anonymous referees for their many helpful comments. 
1 http://www. treasury, govt, nz/downloads/pdfs/gen-conf-14decl 1-chetty.pdf 
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become the most successful at “teaching to the test” would not be the same teachers who were the 
most successful in raising test scores when incentives for raising test scores did not exist. It is 
therefore unclear whether the association between higher test scores and better outcomes that 
Chetty et al. found would hold once teachers' pay depends on their value-added scores. 

Richard Rothstein, a research associate at the Economic Policy Institute and a former 
national education columnist for The New York Times, has pointed out that teachers who do not excel 
at raising their students grades on standardized tests may excel at teaching other skills, such as 
cooperative behavior and social skills, skills that although perhaps antithetical to competitive 
behavior, are nevertheless crucial for success in adulthood (Rothstein, 2012). Thus discouraging or 
perhaps even firing teachers with low value-added scores may hurt, not help, students. 

Professor Dale Ballou of Vanderbilt University reviewed the Chetty et al. study on behalf of 
the National Education Policy Center (Ballou, 2012), but his review is flawed. Ballou raises the 
possibility that unaccounted for factors have been the explanation for both the more successful 
adults and the higher value-added of their teachers in the Chetty et al. study. Of course, such a 
possibility always exists; this is the reason for the adage “correlation does not imply causation.” But 
the mere possibility is not sufficient to rule out causation. To rule out causation, an actual factor 
must be identified such that, once accounted for, the correlation would disappear. The factor that 
Ballou chose to illustrate his criticism, good parenting, would not have such an effect. According to 
Ballou good parents know both how to secure high value-added teachers for their children and how 
to prepare them for greater success in adult life. While this may be true, if the value-added of 
teachers does not affect adult results, why would good parents seek high value-added teachers for 
their children? Ballou's review is not illuminating regarding the validity of the Chetty et al. study. 

Winerip, Rothstein, and Ballou focus on issues that are missing from the Chetty et al. study 
and argue that because of these issues incorporating value-added into teacher evaluations may not be 
useful and may even be harmful. We now turn to an examination of the Chetty et al. study itself. 

Teacher Value-Added and Lifetime Income 

Regarding the impact of teacher value-added on income, the Chetty et al. data yielded two 
results: 1) An increase of one standard deviation in teacher value added for one school year in 
childhood leads to an increase in annual income of $182 at age 28; and 2) Teacher value-added in 
childhood has no effect on annual income at age 30. The only meaningful conclusion that can be 
drawn from these results is that a substantial increase in the value-added of a teacher for one year in 
elementary school leads to a small increase in income in early adulthood, but that this increase 
disappears by age 30. But this is not the conclusion the authors reached. 

Chetty and his co-authors report that in their data the probability that teacher value-added 
does not have an effect on adult income at age 28 is less than 1%. In other words, the effect of 
teacher value-added on income at age 28 is “statistically significant.” Elowever, although the result 
they found for 30 year olds is not statistically significant, the words “not statistically significant” are 
nowhere to be found in their study. Instead the authors write, “The 95% confidence interval for the 
estimate is very wide. We therefore focus on earnings impacts up to age 28 for the remainder of our 
analysis.” 2 The terms “statistically significant” and “confidence interval” are explained in the 
statistical terminology section below, but in plain English their statement means that they did not 
find conclusive evidence that teacher value-added affects the income of 30 year olds. Furthermore, 
they didn't just “focus” on earning impacts up to age 28; instead they proceeded as if the result for 
28 year olds, an increase in income of .09%, was also the result that they found for 30 year olds, and 


2 Chetty et al., page 39. 
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made the assumption that this would have also been the result for any subsequent age group. Based 
on this assumption, which is contradicted by their own evidence, they calculated a life-time benefit 
of $25,000 from an increase of one standard deviation in teacher value-added. 

The authors claim that the reason for the “very wide” interval (translation: not statistically 
significant result) is that the sample of 30 year olds was smaller than the sample of 28 year olds. A 
smaller sample size always produces a wider interval, but this is not relevant. Samples almost always 
differ in size, but this is not license to throw away the results from smaller samples at will. Before 
they conduct a study, researchers typically calculate whether the sample is sufficiently large and 
decide whether to proceed. After conducting a study its authors cannot ignore the results; science 
does not permit cherry picking. The result that teacher value-added does not have a statistically 
significant impact on earnings at age 30 must be part of any conclusion drawn from the Chetty et al. 
study. 

Nevertheless, did Chetty et al. miscalculate the sample size they needed for their study? Was 
their sample of 61,639 thirty years olds too small? The calculation of sample size requires some 
choices and assumptions 3 ; using reasonable choices and assumptions the required sample size would 
have been 8,124. 4 Chetty et al. therefore did not miscalculate when they decided to perform the test 
for 30 year olds. 


Long-Lasting Test Score Gains 


1 The basics of sample size calculations are explained in the statistical terminology appendix. 

4 Sample size was calculated using the program of Professor Russell V. Lenth of the University of Iowa: 
http://homepage.stat.uiowa.edu/~rlenth/Power/index.html. This program is recommended by the NIH, 
which is interested in minimizing the number of animals used in medical research. See: Ralph B. DelLSteve 
Holleran, and Rajasekhar Ramakrishnan, “Sample Size Determination,” Institute for Animal Laboratory 
Research Journal, 2002; 43(4): 207-213, http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3275906/ 

The list of parameters used is as follows: 

i. Detectable Beta (smallest meaningful value that can be detected): 2,000. I chose this value because the 
coefficient that is not statistically significant in the Chetty et al. equation for 30 year olds is $2,058 (the 
standard deviation is $1,953). (Chetty et al. Table 6, page 76.) 

ii. Standard Deviation of Teacher Value-Added: 0.1. This is the value that Chetty et al. found. 

iii. Number of Predictors: 26. Chetty et al. do not provide the actual equation they estimate in Table 6. In a 
note to Figure 1 they list all 25 explanatory variables that they use in their study, and this (plus an intercept) is 
the number we use for the calculation. Because sample size increases with the number of predictors, this is a 
conservative specification. 

iv. Variance Inflation Factor: 1. This value assumes that there is no multicolinearity among the explanatory 
variables. Chetty et al. do not indicate that there is multicolinearity in their data. Nevertheless, we calculated 
the sample size also with a Variance Inflation Factor of 5, and in this case the sample size is 41,020, still some 
20,000 observations smaller than the actual sample. 

v. Alpha, the desired significance level of the test: .05. 

vi. Two tailed: Yes. Because of the possibility that students of high value-added teachers would earn less than 
students of low value-added teachers (see the discussion of Richard Rothstein's (Rothstein, 2012) critique 
above), a two tailed test is indicated. For a one tail test the required sample is smaller, 6,765. 

vii. Error Standard Deviation: $5,000. Chetty et al. do not provide summary statistics for age 30. The standard 
deviation of annual income at age 28 is $23,782. This permits the standard deviation of the error to be as 
large as 1/5 of the standard deviation of annual income. 

viii. Power: .95 



Findings vs. interpretation in “The long-term impacts of teachers” by Chetty et al. 


5 


An obvious question is how an increase in teacher value-added for just one year in 
elementary school can affect earnings in adulthood. Given that the increase in income at age 28 is 
only $182 a year and that at age 30 there is no increase, this question may not really matter. 
Nevertheless, Chetty and his co-authors dealt with this issue in a way that deserves further 
investigation. 

A teacher's value-added is measured by the effect that he or she has on his or her students' 
test scores. Therefore, if teacher value-added affects income in adulthood, the connection must be 
either directly through the student's higher test scores themselves, or through some other change— 
perhaps in work habits—that the teacher caused and that brought about both the increase in the 
student's scores in childhood and the increase in earnings later in life. But studies that preceded 
“The Long-Term Impacts of Teachers” discovered that increases in test scores that are due to 
teachers fade out after just a few years (Rothstein, 2010; Kane and Staiger, 2008; Jacob, Lefgren, and 
Sims. 2010). 5 If the effect of a high-value-added teacher does not last even through a child's years in 
elementary school, how can it last for the rest of the child's life? “The Long-Term Impacts of 
Teachers” does not suffer from this problem because in this study Chetty and his co-authors 
discovered that a high teacher value-added leads to “long lasting test score gains.” Only that, once 
again, this is not what their study shows. 

What the study does show is that an increase of one standard deviation in teacher value-added 
results in an increase of about 0.03 standard deviations in student test scores three years later. If the 
standard deviation of test scores is 27% (the standard deviation is 27% in SAT (Stanford 
Achievement Test) tests and 26% in the BSF (Tennessee Basic Skills First) tests; (Krueger, 1999, p. 
531)), 0.03 of that value is 0.8% - less than 1%. In other words, after three years the effect of the 
high value-added teacher on test scores vanishes. But this is not how Chetty and his co-authors 
present the “fade out.” Instead, they write: “In our data, the impact of a one SD increase in teacher 
quality stabilizes at approximately 0.3 SD after three years, showing that students assigned to 
teachers with higher VA achieve long-lasting test score gains.” While this may have been an honest 
mistake, when asked about this, one of the authors, Jonah Rockoff, responded: “This is definitely a 
language error on our part. Instead of 'a one SD increase in teacher quality' we should have said 'an 
increase in teacher value-added of one (student-level) standard deviation.' Of course an increase of 
one in value added is roughly 10 teacher-level standard deviations, so your assessment of 0.03 in year 
three for a one teacher-level SD increase in year 0 would be correct'" (Jonah Rockoff to Moshe 
Adler, personal communication, October 12, 2012). This answer is problematic, however. 

If the authors were to change the language to reflect Rockoff s correction, it would read: “In 
our data, the impact of [an increase in teacher value-added of one (student-level) standard deviation] 
stabilizes at approximately 0.3 SD after three years, showing that students assigned to teachers with 
higher VA achieve long-lasting test score gains.” But this statement would not be true because the 
probability of finding a teacher whose value-added is 10 teacher-level standard deviations above the 
value-added of other teachers is practically zero. In fact, the range of teacher value-added in the 
Chetty et al. data is from about -0.18 to +0.18, or a maximum difference of about 0.36 teacher-level 
standard deviations. 6 Students cannot be assigned to the high value-added teachers that the Chetty 
et al. statement mentions because such high value-added teachers simply do not exist. 


5 Jacob et al. (2010) found that only 25% of the first year teacher effect remains in the second year. Kane and 
Staiger (2008) found that 50% remain in the second year and 25% remains in the third. J. Rothstein (2010) 
found that 30% remains in the second year. Furthermore, Rothstein also found that the correlation between 
the 1 st year effect and the 3 rd year effect is only 0.4, leading him to conclude: “A teacher’s first-year effect is a 
poor proxy for his or her longer-run impact.” 

6 Chetty et al., Figure 6. 
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The imaginary high value-added teacher is not a minor character in the Chetty et al. study. In 
fact, the authors draw the reader's attention to the non-existent high value-added teacher by 
producing a dramatic chart (Figure 1) that shows what an increase in teacher value-added of one unit 
would do. But, of course, this figure is false, because the maximum value-added increase possible in 
their own sample is only .36. The impact of increasing teacher value-added by one teacher-level 
standard deviation is shown in Figure 2. 



Year 


Figure 1. Impacts of teacher value-added on laggedcurrent, and future test scores 
Source: Figure 2, Chetty et al. 

Note (m.a.): These are the impacts of a one-unit increase in teacher value-added at 
year 0. This increase is equivalent to a 10 teacher-level standard deviations increase in 
teacher value-added. 
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Figure 2: Impacts of teacher value-added on lagged, current, andfuture test scores 

Source: Fantasy line: Figure 2, Chetty et al. Reality line: Author’s calculations 
Note (m.a.): The Reality line shows the impacts of a 0.1 unit increase in teacher value added. This 
increase is equivalent to a one teacher-level standard deviation increase in teacher value-added. The 
Fantasy line is identical to the line in Figure 1. 

Conclusion 

The authors of the study “The Long-Term Impact of Teachers” claim that their study shows 
that increases in teacher value-added lead to significant and lasting increases in test scores and 
significant increases in income that will last throughout adulthood. As this article has shown, these 
claims are false because they are contradicted by the findings of the study itself. A substantial (one 
standard deviation) increase in teacher value-added increases test scores by 2.7% in the first year and 
by 0.8% after three years. The same increase in value-added adds $182 of income at age 28, and no 
statistically significant increase in income at age 30, only two years later. Thus, the results of the 
Chetty et al. study raise serious doubts about the benefits of using the value-added method for 
evaluating teachers. Although the study does not successfully make the case for measuring the value- 
added of teachers, it does very eloquently make the case for the importance of peer review, since the 
claims made in the study have received wide attention and broad endorsement both in the United 
States and internationally. The gap between the results and their interpretation in the Chetty et al. 
study is so significant that there is little doubt that peer reviewers would have identified it. 
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Appendix: Statistical Terminology 


Standard Deviation 

The measurement unit that statistical studies — including the Chetty et al. study - use is the 
standard deviation. I will explain it through an example. Suppose two students, A and B, have test 
scores of 60% and 80% respectively. Their mean grade is 70%, and their individual score's 
deviations from the mean are -10% and +10% respectively. In absolute value the deviations are both 
10%, and therefore the “standard deviation” in this case is 10%. When measured in standard 
deviations, student A's score is -1 standard deviation and student B's score is +1 standard deviation. 
The difference between their scores is two standard deviations.7 

Measuring scores in standard deviations instead of percentages has two advantages. First, 
when measured in standard deviations, student scores are the same regardless of what units the test 
actually used. The SAT test, for example, uses a scale of 0-800 points. With this scale student A's 
score would have been 480, student B's score would have been 640 and the average score would 
have been 560. But in standard deviations A's and B's scores would have still been -1.0 and +1.0, 
respectively. 

Measuring scores in standard deviations has an additional advantage. Suppose that there are 
many students, not just two, and suppose that their test scores are distributed along a bell 
(“normal”) curve. Then a student whose score is one standard deviation above the mean is in the 
84th percentile of scores, and a student who is 1.64 standard deviations above the mean is in the 
95th percentile of scores. In other words, when scores are expressed in standard deviations, we can 
figure out the percentile rankings of the students. 

Teacher Value-Added 

A teacher's value-added score is equal to the number of standard deviations that he or she 
adds, on average, to the test scores of his or her students. What does this mean? 

As was seen above, the standard deviation of student scores is about 27%. If the value-added 
score of a teacher is +1.0, that teacher adds, on average, one standard deviation, or 27%, to his or 
her students' test scores. Chetty et al. found that the differences in teacher value-added scores are 
small, their standard deviation is 0.1. What does this mean? Suppose that the value-added score of 
teacher B is one standard deviation above the value-added score of teacher A. Also suppose that the 
average score of the students in teacher A's class is 60%. Had teacher B been the teacher in that 
classroom instead of teacher A, then the average score of the students would have been 62.7%. 

Three years later, the effect of teacher value-added fades out. If the students who three years 
earlier were in teacher A's class score 70%, these students would have scored 70.8% on average had 
teacher B been their instructor instead of teacher A. 

Statistical Significance 

Chetty et al. estimated the rate of increase in income in adulthood due to teacher value- 
added. Such a rate cannot be estimated with absolute precision, and what scientists report instead is 
a “confidence interval” for which, by convention, the level of confidence is of 95%. 

Most people are familiar with confidence intervals from public opinion polls. A pollster 
always reports the result of his or her poll with a “margin of error.” The margin is determined by 


7 When there are many observations, statisticians use a slightly different method for calculating the standard 
deviation, but for our purposes — understanding the basic concept -- the explanation above is sufficient. 
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the size of the sample and by the level of confidence that the pollster wants to have in his or her 
results. Suppose that the confidence level is 95% and that the pollster found that in his or her 
sample 52% of respondents said they would vote for candidate X. If the margin of error is +/-1%, 
he or she will predict, with 95% confidence, that X will be the winner. But if the margin of error is 
+/-3% he or she will say that the poll is “too close to call” because the confidence interval is “too 
wide.” The span from 49% to 55% contains the possibility that X will lose. Academic researchers 
would that the result of 52% is “statistically insignificant.” 

The result in Chetty et al. — an increase of $206 in annual income at age 30 due to an 
increase of one standard deviation in teacher value-added — is not statistically significant at the 5% 
level because the margin of error is $383. 8 The confidence interval spans from 
-$177 to +$589, and it is therefore impossible to say with a 95% level of confidence whether an 
increase in teacher value-added would lead to a loss or a gain in income at age 30. This is why Chetty 
et al. state that the confidence interval is too wide. 

Sample Size Basics 

To understand the basic calculation involved in determining sample size I will continue with 
the election poll example. As already mentioned, the size of the margin of error depends on the size 
of the sample and the level of confidence, which by convention is set to 95%. With 1,067 
respondents, the margin of error is 3%. With 4,268 respondents the margin of error is 1.5%, and 
with 9,604 the margin of error is only 1%. Of course, if all voters are in the sample the margin of 
error is 0% and the level of confidence is 100%. Polling all voters isn't practical, however, and this is 
why pollsters use smaller samples instead. What size sample should the pollster choose? A pollster 
who is not concerned that if one of the candidates is favored by 53% of respondents (or less) she 
would not be able to predict the elections would draw a sample of 1,067 respondents. A pollster 
who wants to be able to make a prediction even then would have to draw a larger sample. A similar 
choice of parameters is involved when calculating the sample size required for the Chetty et al. 
study. 


8 The standard deviation of the effect of an increase of one standard deviation in teacher value 
added is $195 (Chetty et al., Table 6). 
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